PLOTTING USING SEABORN

Week 5 Part 2 of PubH 6852

Author
Affiliation

Dr JH Klopper

Department of Biostatistics and Bioinformatics

Introduction

Seaborn is a Python plotting package built on matplotlib and automates many of the code that is required for statistical plots. In this notebook, we look at typical plots that you may want to use in your own work.

Packages used in this notebook

import pandas
import seaborn
%config InlineBackend.figure_format = 'retina'

There are various plotting themes available in the seaborn package. Themes create an overall look to plot. More about the set_style that sets the plot styles can be found here. We set the argument of the set_style function to 'whitegrid' for this notebook.

seaborn.set_style(style='whitegrid')

Data

We use the heart_failure.csv spreadsheet file again in this notebook and import it as a pandas dataframe object assigned to the variable df.

# Import spreadhseet file
df = pandas.read_csv('heart_failure.csv')

The column headers (variables) are listed using the columns attribute.

# Columns in the data file
df.columns
Index(['age', 'anemia', 'creatinine_phosphokinase', 'diabetes',
       'ejectionfraction', 'hypertension', 'platelets', 'serum_creatinine',
       'sodium', 'sex', 'smoking', 'time', 'death'],
      dtype='object')

Plots are selected based on the data type of the variable(s) that we want to visualize. The seaborn package divides its high-level plots into three categories, shown in the image below.

Relational plots

Relational plots visualize the correlation between continuous variables.

The seaborn package lists scatter plots and line plots as relational plots. In Figure 1, we use the scatterplot function to create a scatter plot of the \texttt{age} vs. the \texttt{platelets} variables to visualize the correlation between these two continuous variables.

seaborn.scatterplot(
    data=df,
    x='age',
    y='platelets'
);
Figure 1: Age vs. platelet count

The correlation between continuous variables can be visualized by the unique elements of a categorical variable. The hue argument can be set to a categorical variable. Each class (unique element) for the categorical variable will be colored differently. We could also add the style argument and set the value to a categorical variable. This adds more visual contrast, by using different marker styles for each class of the categorical variable. In Figure 2 we specify both argument and set the values to the \texttt{anemia} variable.

seaborn.scatterplot(
    data=df,
    x='age',
    y='platelets',
    hue='anemia',
    style='anemia'
);
Figure 2: Age vs. platelet count for those with and without anemia
NoteTask

Create a scatter plot to visualize the correlation between the \texttt{platelets} and \texttt{sodium} variables, colored by the unique elements of the \texttt{diabetes} variable.

seaborn.scatterplot(
    data=df,
    x='platelets',
    y='sodium',
    hue='diabetes'
);

Instead of using the classes of a categorical variable, we can visualize a third continuous variable, using the hue argument. When the variable is continuous, the value determine the color of the markers. In Figure 3, we add the values of the \texttt{sodium} variable.

seaborn.scatterplot(
    data=df,
    x='age',
    y='platelets',
    hue='sodium'
);
Figure 3: Age vs. platelet count and sodium level

The size argument uses marker size instead of color to visualize a third continuous variable. In Figure 4 choose the size argument instead of the hue argument for \texttt{sodium}.

seaborn.scatterplot(
    data=df,
    x='age',
    y='platelets',
    size='sodium'
);
Figure 4: Age vs. platelet count and sodium level

Distribution plots

Histograms can be used to visualize the frequency of data values. A histogram generates intervals (bins) and counts the occurrences of continuous values in each interval.

The displot function can produce a histogram. In Figure 5, we plot a default histogram of the \texttt{age} variable.

seaborn.displot(
    df,
    x='age'
);
Figure 5: Age distribution

In Figure 6, we specify the bin intervals using the range function. The start value is 40, then end values is 110, and the step size is 10. This will produce a histogram along the age decades, which is more user-friendly to view.

seaborn.displot(
    df,
    x='age',
    bins=range(40, 110, 10)
);
Figure 6: Age distribution
NoteTask

Generate a histogram to visualize the distribution of the \texttt{platelets} variable. Calculate the minimum and the maximum value for the variable and use the range function to create a intervals of 50000. Note that the ticks value on the horizontal axis seem overcrowded. Try to use the argument y instead of x. This will create a horizontal histogram, with more space for the tick values.

df.platelets.min()
df.platelets.max()
seaborn.displot(
    df,
    y='platelets',
    bins=range(0, 850000,50000)
);

In Figure 7, we show a histogram for each of the classes in the \texttt{anemia} variable using the hue argument.

seaborn.displot(
    df,
    x='age',
    bins=range(40, 110, 10),
    hue='anemia'
);
Figure 7: Age distribution for those with and without anemia

Overlaying histograms can be difficult to visualize. If we are only interested in the combined frequency, but still want to visualize the proportions. we can use the multiple argument set to 'stack' to produce a stacked histogram, shown in Figure 8 below.

seaborn.displot(
    df,
    x='age',
    bins=range(40, 110, 10),
    hue='anemia',
    multiple='stack'
);
Figure 8: Age distribution for those with and without anemia

It may be better to produce separate histograms for each of the classes. We achieve this use the col argument. In Figure 9 we have separate histograms for each class of the \texttt{anemia} variable.

seaborn.displot(
    df,
    x='age',
    bins=range(40, 110, 10),
    col='anemia'
);
Figure 9: Age distribution for those with and without anemia

The stat argument can be set to probability to visualize the relative frequency instead of the frequency. In Figure 10, we see the relative frequency version of Figure 9 above.

seaborn.displot(
    df,
    x='age',
    bins=range(40, 110, 10),
    col='anemia',
    stat='probability'
);
Figure 10: Age distribution for those with and without anemia

Heat maps can be used to visualize the distribution of two continuous variables. We add a y argument to the displot function to visualize bivariate distributions. In Figure 11 we visualize the \texttt{age} and the \texttt{platelets} variables. The cbar argument adds a color bar.

seaborn.displot(
    df,
    x='age',
    y='platelets',
    cbar=True
);
Figure 11: Age and platelet distributions

A kernel density estimate adds smoothing to produce heat areas. We add the kind argument in Figure 12 and set it to 'kde'. We also add a rug plot (tick marks along the axes for each observation) using the rug argument with a value of True.

seaborn.displot(
    df,
    x='age',
    y='platelets',
    kind='kde',
    rug=True
);
Figure 12: Age and platelet distributions

The jointplot function can combine different visualizations of the same data, by adding plots in the margins of the plot figure. The default in Figure 13 adds a histogram to a scatter plot given two continuous variables.

seaborn.jointplot(
    df,
    x='age',
    y='platelets'
);
Figure 13: Age and platelet distributions

We can take more control over the joint plot and the marginal plots by assigning a JointGrid object to a variable and adding the plot_joint and plot_matginals methods to the JointGrid object. In Figure 14, we add box-and-whisker plots to the margins.

g = seaborn.JointGrid(
    data=df,
    x='age',
    y='platelets'
)
g.plot_joint(seaborn.histplot)
g.plot_marginals(seaborn.boxplot);
Figure 14: Age and platelet distributions

Categorical plots

Although box-and-whisker plots visualize the distribution of a continuous variable, seaborn lists it as plot for categorical data. It is often used to compare the distribution of a continuous numerical variable between the unique elements of a categorical variable. The catplot function is used for a variety of plots for categorical data.

In Figure 15, we see a box-and-whisker plot of the \texttt{age} variable, for those with and without diabetes (unique elements in the \texttt{diabetes} variable). The kind argument is set to 'box', to indicate a box-and-whisker plot.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    kind='box'
);
Figure 15: Age distribution among those with and without diabetes

In Figure 16 we add a second categorical variable, \texttt{death}, using the hue argument.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='box'
);
Figure 16: Age distribution among those with and without diabetes comparing survivor groups

For larger data sets, the 'boxen' value for the kind argument, gives a better indication of the distribution of the values of the continuous variable. Figure 17 visualizes the same data as Figure 16, setting the kind argument to 'boxen'.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='boxen'
);
Figure 17: Age distribution among those with and without diabetes comparing survivor groups

Violin use a kernel density estimate to create the shape of the plots. This gives an even richer visualization of the distribution of the continuous variable. In Figure 18, we revisit the data of Figure 16, but as a violin plot, setting the kind argument to 'violin'.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='violin'
);
Figure 18: Age distribution among those with and without diabetes comparing survivor groups

To reduce the number of shapes, we can split the violin plots by the unique values of a binary variable such as \texttt{death}, using the split argument. This is shown in Figure 19 below.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='violin',
    split=True
);
Figure 19: Age distribution among those with and without diabetes comparing survivor groups

A number of other visualizations can be created using catplot. In Figure 20, we see a swarm plot, which is a type of categorical scatter plot. The kind argument is set to 'swarm'.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    kind='swarm'
);
Figure 20: Age distribution among those with and without diabetes

Bar plots are quintessential plots for categorical data, showing the frequency of the unique elements of a categorical variable. In Figure 21, we see the frequency of those with and without diabetes. This is achieved by setting the kind argument to 'count'.

seaborn.catplot(
    data=df,
    x='diabetes',
    kind='count'
);
Figure 21: Age distribution among those with and without diabetes comparing survivor groups
NoteTask

We can add another categorical variable by using the hue argument and setting a categorical variable as argument. Create a bar plot using the \texttt{diabetes} variable as in @figbardiabetes, but add the \texttt{death} variable as well.

seaborn.catplot(
    data=df,
    x='diabetes',
    hue='death',
    kind='count'
);

The last type of categorical plot that we consider in this notebook is the point plot. It visualizes the difference in the mean of a numerical variable for the unique elements of a categorical variable, A point plot also shows the 95\% confidence interval around the mean. We can also use the hue argument to shown individual point plots. In Figure 22, we visualize the difference in the \texttt{age} variable between those with and without diabetes, for each of the survivors and non-survivors. The two categorical variables being \texttt{diabetes} and \texttt{death}.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='point'
);
Figure 22: Age difference between those with and without diabetes comparing survivor groups

Setting plot and axes labels

A simple way to add a plot title and the axes labels, is to add the set method to any plot. The set method contains the arguments, xlabel, ylabel, and title. Each argument takes a string as value. In Figure 23, we add a title and labels for both axes.

seaborn.catplot(
    data=df,
    x='diabetes',
    y='age',
    hue='death',
    kind='point'
).set(
    xlabel='Diabetes', # Add a label to the horizontal axis
    ylabel='Age', # Add a label to the vertical axis
    title='Difference in age between those with and without diabetes per survivor group'
); # Add a title
Figure 23: Age difference between those with and without diabetes comparing survivor groups